Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, FiD suffers from very expensive inference. We show that the majority of inference time results from memory bandwidth constraints in the decoder, and propose two simple changes to the FiD architecture to speed up inference by 7x. The faster decoder inference then allows for a much larger decoder. We denote FiD with the above modifications as FiDO, and show that it strongly improves performance over existing FiD models for a wide range of inference budgets. For example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than FiD-Large.
translated by 谷歌翻译
尽管近年来在多机构增强学习(MARL)方面取得了重大进展,但复杂领域的协调仍然是一个挑战。 MARL的工作通常专注于解决代理与环境中所有其他代理和实体互动的任务;但是,我们观察到现实世界任务通常由几个局部代理相互作用(子任务)的几个隔离实例组成,并且每个代理都可以有意义地专注于一个子任务,以排除环境中其他所有内容。在这些综合任务中,成功的策略通常可以分解为两个决策级别:代理人分配给特定的子任务,并且每个代理人仅针对其指定的子任务有效地采取行动。这种分解的决策提供了强烈的结构感应偏见,大大降低了代理观察空间,并鼓励在训练期间重复使用和组成子任务特异性策略,而不是将子任务的每个新组成视为独特的。我们介绍了ALMA,这是一种利用这些结构化任务的一般学习方法。阿尔玛同时学习高级子任务分配策略和低级代理政策。我们证明,阿尔玛(Alma)在许多具有挑战性的环境中学习了复杂的协调行为,表现优于强大的基准。 Alma的模块化还使其能够更好地概括为新的环境配置。最后,我们发现,尽管ALMA可以整合受过训练的分配和行动策略,但最佳性能仅通过共同训练所有组件才能获得。我们的代码可从https://github.com/shariqiqbal2810/alma获得
translated by 谷歌翻译
已显示通用非结构化神经网络在分布外的组成概述上挣扎。通过示例重组的组成数据增强已经转移了一些关于组成性的关于多个语义解析任务的黑盒神经模型的先前知识,但这通常需要特定于任务的工程或提供有限的收益。我们使用称为组成结构学习者(CSL)的型号提供更强大的数据重组方法。 CSL是一种具有拟同步无线语法骨干的生成模型,我们从训练数据中诱导。我们从CSL中进行重组的例子,并将其添加到预先训练的序列到序列模型(T5)的微调数据中。该程序有效地将大多数CSL的组成偏差转移到T5以进行诊断任务,并导致模型比在两个真实世界的组成泛化任务上的T5-CSL集合更强。这导致新的最先进的性能,这些挑战性的语义解析任务需要泛化自然语言变异和元素的新组成。
translated by 谷歌翻译
在学习动作识别中,模型通常预先接受对象识别,例如图像,例如想象成,稍后在与视频的目标动作识别上微调。这种方法造成了良好的经验性能,特别是最近的基于变压器的视频架构。虽然最近许多作品旨在为行动识别设计更先进的变压器架构,但如何训练视频变压器的努力。在这项工作中,我们探索了几种培训范式并提出了两个结果。首先,视频变压器受益于各种视频数据集和标签空间的联合培训(例如,动力学是关注的,而某些东西是以运动为中心的)。其次,通过进一步与图像共同训练(作为单帧视频),视频变换器学习更好的视频表示。我们将这种方法作为用于行动识别的共同培训视频和图像(封面)。特别是,当基于时序形式的架构上的ImageNet-21k上掠夺时,盖子将动力学-400的前1个精度提高2.4%,动力学-600以2.3%,有些东西-V2达2.3%。当以前最先进的较大刻度图像数据集预先磨削时,覆盖覆盖在动力学-400(87.2%),动力学-600(87.9%),动力学-700(79.8%),有些内容达到最佳结果(70.9%),和时刻 - 时间(46.1%),具有简单的时空视频变压器。
translated by 谷歌翻译
神经网络模型通常概括到不匹配的域或分布不符。在NLP中,特别是当预期模型概括为合作的模型,即熟悉词汇和建筑的新组合时,尤其产生这个问题。我们调查促进从一个组成任务转移到另一个组成任务的学习的学习陈述:模型的代表和任务特定层在预先驾驶任务上具有不同的培训,使得它们概括为需要合成性的不匹配分裂。我们将此方法应用于语义解析,使用三个非常不同的数据集,COG,地理信息集和扫描,作为FineTuning和目标任务交替使用。我们的方法显着改善了在目标任务的测试组上的基线上的组成概括,在微调期间被列出。消融研究表征了所提出的算法中主要步骤的效用,并支持我们的假设。
translated by 谷歌翻译
Learning with limited data is a key challenge for visual recognition. Many few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them leads to the target task. In this paper, we propose a novel approach to adapt the instance embeddings to the target classification task with a set-to-set function, yielding embeddings that are task-specific and are discriminative. We empirically investigated various instantiations of such set-to-set functions and observed the Transformer is most effective -as it naturally satisfies key properties of our desired model. We denote this model as FEAT (few-shot embedding adaptation w/ Transformer) and validate it on both the standard few-shot classification benchmark and four extended few-shot learning settings with essential use cases, i.e., cross-domain, transductive, generalized few-shot learning, and low-shot learning. It archived consistent improvements over baseline models as well as previous methods, and established the new stateof-the-art results on two benchmarks.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译
Differentiable Architecture Search (DARTS) has attracted considerable attention as a gradient-based Neural Architecture Search (NAS) method. Since the introduction of DARTS, there has been little work done on adapting the action space based on state-of-art architecture design principles for CNNs. In this work, we aim to address this gap by incrementally augmenting the DARTS search space with micro-design changes inspired by ConvNeXt and studying the trade-off between accuracy, evaluation layer count, and computational cost. To this end, we introduce the Pseudo-Inverted Bottleneck conv block intending to reduce the computational footprint of the inverted bottleneck block proposed in ConvNeXt. Our proposed architecture is much less sensitive to evaluation layer count and outperforms a DARTS network with similar size significantly, at layer counts as small as 2. Furthermore, with less layers, not only does it achieve higher accuracy with lower GMACs and parameter count, GradCAM comparisons show that our network is able to better detect distinctive features of target objects compared to DARTS.
translated by 谷歌翻译
We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.
translated by 谷歌翻译